{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applying Regexes To 10-Ks\n",
    "\n",
    "### Introduction\n",
    "\n",
    "In this notebook you will apply regexes to find useful financial information in 10-Ks. In particular, you will use what you learned in previous lessons to extract text from Items 1A, 7, and 7A."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting The HTML File\n",
    "\n",
    "In this lesson, we will be working with the 2018, 10-K from Apple. In the code below, we will use the `requests` library to get the HTML data from this 10-K directly from the SEC website. We will learn more about the `requests` library in a later lesson. We will save the HTML data into a string variable named `raw_10k`, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import requests\n",
    "import requests\n",
    "\n",
    "# Get the HTML data from the 2018 10-K from Apple\n",
    "r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')\n",
    "raw_10k = r.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we print the `raw_10k` string we will see that it has many sections. In the code below, we print part of the `raw_10k` string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105\n",
      "<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105\n",
      "<ACCEPTANCE-DATETIME>20181105080140\n",
      "ACCESSION NUMBER:\t\t0000320193-18-000145\n",
      "CONFORMED SUBMISSION TYPE:\t10-K\n",
      "PUBLIC DOCUMENT COUNT:\t\t88\n",
      "CONFORMED PERIOD OF REPORT:\t20180929\n",
      "FILED AS OF DATE:\t\t20181105\n",
      "DATE AS OF CHANGE:\t\t20181105\n",
      "\n",
      "FILER:\n",
      "\n",
      "\tCOMPANY DATA:\t\n",
      "\t\tCOMPANY CONFORMED NAME:\t\t\tAPPLE INC\n",
      "\t\tCENTRAL INDEX KEY:\t\t\t0000320193\n",
      "\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tELECTRONIC COMPUTERS [3571]\n",
      "\t\tIRS NUMBER:\t\t\t\t942404110\n",
      "\t\tSTATE OF INCORPORATION:\t\t\tCA\n",
      "\t\tFISCAL YEAR END:\t\t\t0930\n",
      "\n",
      "\tFILING VALUES:\n",
      "\t\tFORM TYPE:\t\t10-K\n",
      "\t\tSEC ACT:\t\t1934 Act\n",
      "\t\tSEC FILE NUMBER:\t001-36743\n",
      "\t\tFILM NUMBER:\t\t181158788\n",
      "\n",
      "\tBUSINESS ADDRESS:\t\n",
      "\t\tSTREET 1:\t\tONE APPLE PARK WAY\n",
      "\t\tCITY:\t\t\tCUPERTINO\n",
      "\t\tSTATE:\t\t\tCA\n",
      "\t\tZIP:\t\t\t95014\n",
      "\t\tBUSINESS PHONE:\t\t(408) 996-1010\n",
      "\n",
      "\tMAIL ADDRESS:\t\n",
      "\t\tSTREET 1:\t\tONE APPLE PARK WAY\n",
      "\t\tCITY:\t\t\tCUPERTINO\n",
      "\t\tSTATE:\t\t\tCA\n",
      "\t\tZIP:\t\t\t95014\n",
      "\n",
      "\tFORMER COMPANY:\t\n",
      "\t\tFORMER CONFORMED NAME:\tAPPLE COMPUTER INC\n",
      "\t\tDATE OF NAME CHANGE:\t19970808\n",
      "</SEC-HEADER>\n",
      "<DOCUMENT>\n",
      "<TYPE>10-K\n",
      "<SEQUENCE>1\n",
      "<FILENAME>a10-k20189292018.htm\n",
      "<DESCRIPTION>10-K\n",
      "<TEXT>\n",
      "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n",
      "<html>\n",
      "\t<head>\n",
      "\t\t<!-- Document created using Wdesk 1 -->\n",
      "\t\t<!-- Copyright 2018 Workiva -->\n",
      "\t\t<title>Document</title>\n",
      "\t</head>\n",
      "\t<body style=\"font-family:Times New Roman;font-size:10pt;\">\n",
      "<div><a name=\"s3540C27286EF5B0DA103CC59028B96BE\"></a></div><div style=\"line-height:120%;text-align:center;font-size:10pt;\"><div style=\"padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;\"><table cellpadding=\"0\" cellspacing=\"0\" style=\"font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;\"><tr><td colspan=\"1\"></td></tr><tr><td style=\"width:100%;\"></td></tr><tr><td style=\"vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000\n"
     ]
    }
   ],
   "source": [
    "print(raw_10k[0:2000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Regexes for Tags\n",
    "\n",
    "For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K, are contained with the `<DOCUMENT>` and `</DOCUMENT>` tags. Each section within the document tags is clearly marked by a `<TYPE>` tag followed by the name of the section. In the code below, write three regular expressions:\n",
    "\n",
    "1. A regex to find the `<DOCUMENT>` tag\n",
    "\n",
    "2. A regex to find the `</DOCUMENT>` tag\n",
    "\n",
    "3. A regex to find all the sections marked by the `<Type>` tag"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import re module\n",
    "\n",
    "\n",
    "# Write regexes\n",
    "doc_start_pattern = \n",
    "doc_end_pattern = \n",
    "type_pattern = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Create Lists with Span Indices\n",
    "\n",
    "Now, that you have the regexes, use the `.finditer()` method to match the regexes in the `raw_10k`. In the code below, create 3 lists:\n",
    "\n",
    "1. A list that holds the `.end()` index of each match of `doc_start_pattern`\n",
    "\n",
    "2. A list that holds the `.start()` index of each match of `doc_end_pattern`\n",
    "\n",
    "3. A list that holds the name of section from each match of `type_pattern`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 3 lists with the span idices for each regex\n",
    "doc_start_is = \n",
    "doc_end_is = \n",
    "doc_types = "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Create a Dictionary for the 10-K\n",
    "\n",
    "In the code below, create a dictionary which has the key `10-K` and as value the contents of the `10-K` section found above. To do this, create a loop, to go through all the sections found above, and if the section type is `10-K` then save it to the dictionary. Use the indices in  `doc_start_is` and `doc_end_is`to slice the `raw_10k` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "document = {}\n",
    "\n",
    "# Create a loop to go through each section type and save only the 10-K section in the dictionary\n",
    "\n",
    "\n",
    "# display the document\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Find Item 1A, 7, and 7A\n",
    "\n",
    "Our task now is to use regular expression to find the items of interest. The items in this `document` can be found in four different patterns. For example Item 1A can be found in either of the following patterns:\n",
    "\n",
    "1. `>Item 1A`\n",
    "\n",
    "2. `>Item&#160;1A` \n",
    "\n",
    "3. `>Item&nbsp;1A`\n",
    "\n",
    "4. `ITEM 1A` \n",
    "\n",
    "In the code below write a single regular expression that can match all four patterns for Items 1A, 7, and 7A. Then use the `.finditer()` method to match the regex to `document['10-K']`. Finally create a for loop to print the `matches`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write the regex\n",
    "regex = \n",
    "\n",
    "# Use finditer to math the regex\n",
    "matches = \n",
    "\n",
    "# Write a for loop to print the matches\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your regex is written correctly, the only matches above should be those for Items 1A, 7, and 7A. You should notice also, that each item is matched twice. This is because each item appears first in the index and then in the corresponding section. We will now have to remove the matches that correspond to the index. We will do this using Pandas in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Remove Matches that Correspond to the Index\n",
    "\n",
    "We will remove the matches that correspond to the index using a Pandas Dataframe. We will do this in a couple of steps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO: Create a Pandas DataFrame\n",
    "\n",
    "In the code below create a pandas dataframe with the following column names: `'item','start','end'`. In the `item` column save the `match.group()` in lower case letters, in the ` start` column save the `match.start()`, and in the `end` column save the ``match.end()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import pandas\n",
    "\n",
    "\n",
    "# Matches\n",
    "matches = \n",
    "\n",
    "# Create the dataframe\n",
    "test_df = \n",
    "\n",
    "\n",
    "# Display the dataframe\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO: Eliminate Unnecessary Characters\n",
    "\n",
    "As we can see, our dataframe, in particular the `item` column, contains some unnecessary characters such as `>` and periods `.`. In some cases, we will also get unicode characters such as `&#160;` and `&nbsp;`. In the code below, use the Pandas dataframe method `.replace()` with the keyword `regex=True` to replace all whitespaces, the above mentioned unicode characters, the `>` character, and the periods from our dataframe. We want to do this because we want to use the `item` column as our dataframe index later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get rid of unnesesary charcters from the dataframe\n",
    "\n",
    "\n",
    "# display the dataframe\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO: Remove Duplicates\n",
    "\n",
    "Now that we have removed all unnecessary characters form our dataframe, we can go ahead and remove the Item matches that correspond to the index. In the code below use the Pandas dataframe `.drop_duplicates()` method to only keep the last Item matches in the dataframe and drop the rest. Just as precaution make sure that the `start` column is sorted in ascending order before dropping the duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop duplicates\n",
    "pos_dat = \n",
    "\n",
    "# Display the dataframe\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Set Item to Index\n",
    "\n",
    "In the code below use the Pandas dataframe `.set_index()` method to set the `item`  column as the index of our dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set item as the dataframe index\n",
    "\n",
    "\n",
    "# display the dataframe\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Get The Financial Information From Each Item\n",
    "\n",
    "The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, save all the text from the starting index of `item1a` till the starting index of `item7` into a variable called `item_1a_raw`. Similarly,  save all the text from the starting index of `item7` till the starting index of `item7a` into a variable called `item_7_raw`. Finally,  save all the text from the starting index of `item7a` till the end of `document['10-K']` into a variable called `item_7a_raw`. You can accomplish all of this by making the correct slices of `document['10-K']`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get Item 1a\n",
    "item_1a_raw = \n",
    "\n",
    "# Get Item 7\n",
    "item_7_raw = \n",
    "\n",
    "# Get Item 7a\n",
    "item_7a_raw = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Display Item 1a\n",
    "\n",
    "Now that we have each item saved into a separate variable we can view them separately. For illustration purposes we will display Item 1a, but the other items will look similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display Item 1a\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the items looks pretty messy, they contain HTML tags, Unicode characters, etc... Before we can do a proper Natural Language Processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In principle we could do this using regex substitutions as we learned previously, but his can be rather difficult. Luckily, packages already exist that can do all the cleaning for us, such as **Beautifulsoup**, which will be the topic of our next lessons."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solution\n",
    "\n",
    "[Solution notebook](applying_regexes_10ks_solution.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
